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Wc have developed a generally applicable experimental 
procedure to find functional proteins that are many 
mutational steps from wild type. Optimization algorithms, 
which are typically used to search for solutions to certain 
combinatorial problems, have been adapted to the problem 
of searching the 'sequence space' of proteins. Many of the 
steps normally performed by a digital computer are embodied 
In this new molecular genetics technique , termed recursive 
ensemble mutagenesis (REM). REM urn information gained 
from previous iterations of combinatorial cassette mutagenesis 
(CCM) to search sequence space more efficiently- We have 
used REM to simultaneously mutate six amino acid residues 
in a model protein. As compared to conventional CCM, one 
iteration of REM yielded a 30-fold increase In the frequency 
of 'positive' mutants. Since a multiplicative factor of similar 
magnitude is expected for the mutagenesis of additional sets 
of six residues, performing REM on IB sites is expected to 
yield an exponential (30 000-fold) increase in the throughput 
of positive mutants as compared to random [NN(G,C)] 18 
mutagenesis. 

Key words: light harvesting H/protein engineering/random 
mutagenesis 



Introduction 

Current endeavors to engineer new specificities in antibodies and 
their derivatives hold the promise of new therapeutic and 
diagnostic tools. The generation of new and informative mutant 
proteins is necessary to our understanding of protein 
structure -function relationships. Such tasks are made difficult 
by our inability to predict structure from primary sequence or 
even to predict function from structure. One strategy circum- 
venting the gaps in our understanding involves the selection of 
desired phenotypes from a large pool of different genotypes, in 
a manner analogous to natural selection. A limitation of this is 
the combinatorial explosion problem: as the number of 
randomized (mutated with all 20 amino acids) sites in a protein 
increases, the number of possible combinations which must be 
evaluated to identify 'positives' grows exponentially as 20", 
where n is the number of sites altered. Ingenious methods have 
been devised to allow screening of increasingly complex libraries 
of mutant proteins, peptides and oligonucleotides. Phage display 
libraries (Smith, 1985; Hoogenboom et <tf* % 1991; Kang et al , 

1991) and mutated ribozyme populations (Beaudry and Joyce, 

1992) are instances of 'systems* where the genotypes and 
phenotypes are physically linked to allow for rapid selection and 
amplification of extremely complex ensembles of mutants. To 
completely screen a library of mutant proteins with 20 randomized 
amino acid residues (rt « 20) t the synthesis of 20 20 or IQf 26 



different protein molecules is required. Obviously, this will 
challenge our technical capabilities for some time. It may be 
desirable to avoid the very high proportion of non-functional 
proteins in a random library and simply enhance the frequency 
of functional proteins, thus decreasing the complexity required 
to achieve a useful sampling of sequence space. Recursive 
ensemble mutagenesis (REM) is an algorithm which enhances 
the frequency of functional mutants in a library when an 
appropriate selection or screening method is employed (ArJdn 
and Youvan, 1992a; Vouvan etai, 1992). 

REM uses successive rounds of CCM (Oliphant et cl , 1986; 
RcidhaarOison etal., 1991) to generate a diverse library of 
genetically altered proteins that fit certain selection criteria (Figure 
I), Amino acids are retained in the library if they are found in 
an altered protein Fitting the selection criteria* Lists of all amino 
acids that are acceptable at each mutated position (i.e. 'targfel 
sets' of amino acids) are compiled. In the next iteration of REM, 
combinatorial cassettes are resynthesized according to 
mathematical functions that bias the nucleotide mixtures (Arkin 
and Youvan, 1992b; Youvan etui, 1992) at each mutated 
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Fig. \. REM involves the recursive use of combinatorial cmtuctto 
mutageawu (CCM). The first slep of REM begins by expressing and 
screening a CCM library. Two or mow 'positive' muamts tire then picked 
and sequenced, (Positive mutant* ore defined in the current experiment aa 
binding slgnlfleaiu Levels of rcd-ihtftcd Bchl which is charaetertatlo of LHU 
assembly^ Next, n )Ut of unique protein sequences Is determined by 
translating these DNA sequences. A 'unique sequence* Is defined at the 
protein level, If more than one protein hna the some sequence, only the tint 
occurrence of this sequence U retained and counted as unique. Per each 
Ttiuintcd position in the protein, a target set of acceptable ninino oclds la 
compiled and the motf appropriate dupe Is determined by s njRtbenwtlcal 
function such as group probability (Pa). The next iteration of REM 
proceeds by wring these 'intelligent' dopea to gancroto a combinatorial 
cassette of lower complexity. In order to take advantage of (he properties of 
REM, the complexity of the possible peptide sequences arising from CCM 
should be ihowi ta bo in vast excess of the screening size Crouvan et ai, 
lW>2). 
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position in the protein to encode these target sets of amino acids. 
For example, if Ala. Ser and Thr occur at a given position in 
different selected mutants, these amino acids constitute the target 
set at that position. A mathematical function is used to select the 
best 'dope* that maximizes the probabilities of the amino acids 
in the target set. The next cassette is then designed such that this 
target set is encoded by a simple mixture of nucleotides at that 
codon (e.g. [(G,A,T)(C)(G,C)1|. In certain cases, where there 
is a good mutch between selection criteria and structure inherent 
Ln the genetic code (Sjostrom and Wold, 1985; Youvan. 1991) 
such as hydropathy and molar volume, computer simulations 
predict that multiple iterations of REM will yield thousands of 
times more mutants than conventional CCM (Arkin and Youvan. 
1992b; Youvan etai, 1992), 

As a model system to experimentally verify the computer 
predicted amplification by REM, the light harvesting E (LHU) 
0-subunit gene (Youvan and Ismail, 1985) of Rhodobacter 
capsulars was chosen. The Ltfll protein has two characteristic 
absorption bands in the near infrared (800 and 858 nm) that are 
red shifted relative to protein-free bacteriochlorophyll (Bchl) 
absorption at 770 nm. These prosthetic groups serve as colon- 
metric indicators of protein expression and subunit assembly. Six 
cajrboxy-tertiiinal residues of the 0-subunit were initially mutated 
by construction of a combinatorial cassette containing the 
sequence [NN(G,Q] 6 , where 'N' designates an equiprobable 
mixture of all four nucleotides. This CCM library was conjugated 
into a strain of R.capsulams (U71) totally deficient in Bchl- 
binding proteins or any other compounds with significant absorp- 
tion in the near infrared (Youvan *taL, 1985). This deletion 
background facilitates the use of digital imaging spectroscopy 
(DIS) (Arkin eral. t 1990; Arkin and Youvan, 1993) to screen 
thousands of colonies directly on Petri dishes for LHU expression. 
We then sequenced five functional mutants and used this limited 
data to construct a new CCM library. The frequency of positives 
was increased 30-fold relative to the original library. 

Materials and methods 

Plasmids and strains 

Plasmid pU4b is a shuttle vector used for cassette mutagenesis 
as well as expression of the mutant LHII genes (Goldman and 
Youvan, 1992). M13 was our vector for single-stranded 
sequencing and was propagated in Escherichia coli MV1190, 
Escherichia coli strain S17-1 was used for library construction 
and conjugation with Rxapsukrus U7I . For expression of the 
libraries, R.cap$ulatus U71, an LHII chromosomal deletion 
background (LHU and reaction center expression inactivated by 
a point mutation) was used. 

Materials and DNA manipulations 

DNA manipulations were essentially performed as described by 
Sambrook et at. (1989). Restriction enzymes were obtained from 
New England Biolabs, T4 DNA ligase was from Bethesda 
Research Labs as was Taq polymerase. Sequencing was carried 
out using a Sequenase kit from United States Biochemicals. 
Ekctroporarjon was carried out in 0.2 cm cuvettes on 0.45 ml 
of competent cells using a Bio-Rad electroporator according to 
instructions provided. All oligonucleotides were synthesized on 
an Applied Biosy stems model 381 DNA synthesizer using 
comraerciaily available reagents. 
Library construction 

Hie unique Kpnl and Xhol sites of pU4b flank the region encoding 
the dimer Bchl binding site and the carboxy<enrrinu8 of the 0- 
subunit LHII gene. These restriction sites were engineered to 
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allow double-stranded combinatorial cassettes to Ik suhcluncd 
in place of the wild type sequence. 

The sense strand of the 113-mcn;. which included the 
Kpnl-Xhol sites, as well as two PCR priniers (20-mtrr.s each 
spanning a restriction site) were synthesized. The doped sequence 
within the cassette used in (he zero iteration was [NN'O.OU. 
The purified U3-mer was amplified hy PCK. Amplified douhlc- 
stranded cassette was then purified hy phenol extraction unci 
ethanol precipitation. Complete digestion ol die casNctte with Kpnl 
and Xhol is carried out in n single incubation. The digested 
cassette is then purified by phenol and ether extractions and ultra- 
filtration in a Centricon 30 device < Ant icon K 

ligation is carried out for 24 h u( trVO in 20 /d with 
approximately 0. 1 /ifi of pU4b similarly difiCNtcd with Kpnl and 
Xhol. The resulting pU4b derivative (an aliquot of the ligation) 
are directly electroporated into S 17-1 E.voli, AliuuoLs of (he (raas- 
formation are plated on LB-teiracyelinc plates (after allowing 1 h 
for resistance expression) for complexity estimation and the 
remainder of the transformation is ineuhated overnight In 60 ml 
of LB-tetracycline. Plasmid pU4h derivatives were conjugated 
from B.coli S 17-1 donors into R.ccjpxulttms strain U71. The 
library is expressed by U7I transconjuganls selected for hy 
growth on RCV-tetracycline plates at 32 fl C, 
Dope optimization 

In computer simulations, vurious functions were used to optimize 
the *twcleotide mixtures*. In this work, only five iunetionul 
mutant sequences were obtained in the /.cm iteration. Given this 
small number of sequences and in order to conserve diversity, 
we elected to use the group probability (Pa) function because 
it retains all amino acids in the target set. When presented with 
a target set at one position, the program 'CybcrDope' (provided 
courtesy of KAIROS Inc.. Cambridge, MA, USA) goes through 
all integer nucleotide mixtures possible for a codon and evaluates 
for each mixture the value of P a : 

Pq = np D U] (I) 
where P^i] is the frequency of occurrence of the Ah amino acid 
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(in a target set of i amino acids) as encoded by a specific triplet 
dope. For the hypothetical target set menHcmed above (Ala Se 
and Thr), any mixture not er.cod.ng a membcrof the urger set 
(e g- Pd Ala] = 0) will cause P a » be zero. The mixture ^wuh 
^highest value of P G ™U be selected for the tope a. to 
position. The doped sequence within the cassette used in the first 

iteration of REM was _ _ _ ^^ ri ^ 

UG.T)(CJ)(C l G)][(A 1 G,T)(CJ)(C,G)]KC,T)(C t G)G)[(C p T) 

(C f G)G][(A.G,T)(G,T)G][(C 1 T,OXCJ,0)C] 
Imaging spectroscopy 

Colonies were imaged as spreads on RCV-tetracycline plates from 
the bacteria resuspended after conjugation. The most recent 
configuration of the digital imaging spectrophotometer has been 
described (Aririn and Youvan, in press). For the fluorescence 
images the Petri dishes were illuminated with broad-band 
bluT^green light and an 830 nm long pass filter was placed m 
front of the CCD lens to obtain radiometrically calibrated 
monochrome images which were linearly mapped to pseudocolors 
after establishing the low and high gray scale values for both 
images. 

Results 

The experimental complexity (i.e. number of independentiy 
generated clones) of the 'zero iteration' [NN(G,C)) 6 library was 
approximately 45 000, The theoretical complexity of such a 
library at the nucleotide level is calculated as 32 e (1.1 X 10 ) 
because there are 32 possible [NN(G,Q] codons; the 
experimental complexity is only a small fraction of this number* 
Preliminary screening used fluorescence, (Yang and You van, 
1988) which is indicative of LHH assembly* to rapidly identify 
mutants expressing LHI1. Mutants are then more closely 
evaluated by ground state absorption measurements using DIS. 
We observed a low frequency of highly fluorescent colonies in 
the zero iteration of REM (ca. one positive mutant in 10 000 
colonies screened). Relative to wild type absorption, DIS showed 
a decrease in the optical density at 800 and 858 nm for these 
few positives. 

Because of their rarity, only five positives were obtained from 
the zero iteration of REM. Four of these five mutants fit the 
selection criterion of displaying significant absorbance at 858 nm 
and another, REM0.10, had an interesting phenotype. The five 
positives were repurified and sequenced (Table I). The 
composition of a first iteration cassette was calculated by the 
computer program 'CyberDope 1 , which generates DNA dopes 
that TrtflTfimiTP the overall probability of the target set. To add 
diversity to the target set, the wild type sequence was also 
included. Therefore, while not taking frequency of occurrence 
into account because of the small sample size, for the first doped 
position the target set is F, S, A, L, The output of CyberDope 
at the micleotide level gave the codon ((G,T)(C,T)(C,G)] , which 
encodes amino acids A, S, V (0.25 probability of occurrence 
for each) and F, L (0.12 probability of occurrence for each). 
Valine is unavoidably encoded by this dope because of the 
structure of the genetic code. 

Figure 2 demonstrates the amplification properties of the REM 
methodology as assayed by digital imaging spectroscopy using 
both fluorescence emission and ground state absorption imagery. 
The first iteration of REM yields a 30-fold increase in the 
frequency of enhanced fluorescence mutants (Figure 2 A and B). 
As compared to zero iteration REM data, DIS analysis of the 
first iteration library shows both an increase in the percentage 
of positive mutants (i.e. throughput) and an increase in protein 
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that computer simulations were accurate in their 
if an increased throughput of positives, an I.HI! nene 
mutagenized at its six earboxy terminal residues, 
iero iteration (CCM) <luta. target sets of amino acids 
, A computer encoded algorithm generated n doped 
which best represented the target set at each 
, position. Expression of this new library (the first 
. REM) reveuled a substantial amplification in the 
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of similar stringency would yield a Mi 1 or 27 OtXMold 
throughput over random mutagenesis usiiu: 
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proteins obtained hy combinatorial mutagenesis 
necessarily trivial variations of the wild type sequence, 
on of a completely conserved motif was observed in 
._'„_..♦ Therefore, the sequence dutu indicate thai RHM 
recapitulate the known phylogcny. Mechanistically, the 
(experimental) randomisation of six sites in a pruein 
no analogy in nature. 
\|iork, experimental evidence is given that RliM allows 
search of sequence space by producing nullum 
i increased frequencies of selected 'positives'. Hue 
higl|( stringency of the region chosen for mutagenesis, only 
tnce database waft available for the construction of 
ion dope. In systems where large complexities can 
easily (e.g. phage display libraries)* more situs can 
at once and more positives isolated, giving u more 
sequence database. As a consequence, other dope 
(Youvan et al> ♦ 1992) could be used which 
ibetter suited to yield large increases in throughput, 
y, different short stretches of amino acids could be 
and the zero iteration data from these libraries pooled 
a first iteration dope mutageni/.ing many more sites 
ordinarily possible with CCM. 

important to make the connection between our 
•based doping schemes and protein engineering 
tyhere CCM is currently being used. RfiM decreases 
of null mutants in the population, therefore more sites 
simultaneously mutated. Model experiments on LHI1 can 
' optimize REM methodology, including the nucleotide 
equations. 'While DIS is limited to screening about I0 h 
' phage display libraries (Smith, 1985 ; Hoogenboom 
91; Kang et ai. t 1991) can be used to select mutants 
libraries with complexities exceeding I0 U , Based on our 
experiments, we expect greater phenotypic diversity 
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after one iie.uii.rn of REM. Th > means that * « «J« 
can be isolated, which is the rundanienul goal «f*" P^.J*£ 
Sociology. The use iif CCM to i n ir<xluce addiuonal d vewty 
K5wyHbn.ri« has already pmvcn U uttft.1 approach (Barbas 
««/.. 1W) and may well be enhanced by the use ol >u 
n umgenesis when*. REM is the first optimism* techmque lhai 
3 used i» address this problem and explore sequence space 
in a mathematically rigorous lashion. 
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